Improve Mistral models integration with llama.cpp #14737

juliendenize · 2025-07-17T12:30:40Z

Description

This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:

Context

The current HF conversion to GGUF does not work directly for Mistral models due to our format that is vLLM based. This means that we have to first convert weights to Hugging Face then to GGUF which is not ideal and can lead to conversion errors if the first conversion is not done correctly. It also means that adding new models to the llama.cpp ecosystem requires first adding them to Transformers.
We do not support chat templates natively which means chat templates are community based and not guaranteed to work correctly.
We are using mistral-common internally for tokenization and want the community to use it to unlock full capacities of our models. As mistral-common is a Python library, we have opened a PR to add a REST API via FastAPI to make it easier for users who are not in the Python ecosystem.

Using mistral-common with llama.cpp

We recommend that users only use the llama-server tool with the /completions route of the server for now, as it is the only one that supports tokens input. We also advise users to set return_tokens=True in their requests to let mistral-common handle detokenization.

Added features

Model conversion:

We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at convert_mistral_to_gguf.py and can be used to convert Mistral models to GGUF format.

Model architecture:

We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.

Known Limitations:

Our approach does not support multimodality:

mistral-common handles processing multimodal data but they cannot be passed to llama.cpp via the route.
llama.cpp only supports multimodality via chat templates, which we do not support.

Also this approach requires users to only use the llama.cpp server with the /completions route.

Example Code

To get started, install mistral-common using the following command:

(Optional) Convert the model

HF_TOKEN=... python convert_mistral_to_gguf.py \
mistralai/Devstral-Small-2505 --remote --ctx-train 131072 --outtype bf16

Launch the mistral-common and llama.cpp servers

pip install git+https://github.com/mistralai/mistral-common.git@improve_llama_cpp_integration[server]

Launch the mistral-common server:

HF_TOKEN=... mistral_common mistralai/Devstral-Small-2505 --port 6000

Launch the llama.cpp server:

./build/bin/llama-server -m models/Devstral-Small-2505-Q4_K_M.gguf --port 8080

Use the servers

Here is a code snippet demonstrating how to use the new features:

import requests

mistral_common_url = "http://127.0.0.1:6000"
llama_cpp_url = "http://127.0.0.1:8080"

def tokenize(messages, url):
    response = requests.post(f"{url}/tokenize/messages", json=messages)
    return response.json()

def detokenize(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens})
    return response.json()

def detokenize_message(tokens, url):
    response = requests.post(f"{url}/detokenize", json={"tokens": tokens, "as_message": True})
    return response.json()

def generate(tokens, url):
    response = requests.post(f"{url}/completions", json={
        "prompt": tokens,
        "stream": False,
        "return_tokens": True
    })
    return response.json()

messages = [
    {"role": "system", "content": "You are Devstral a cool coding agent that can help users with their coding needs."},
    {"role": "user", "content": "Who are you and what can you do?"}
]

tokens = tokenize(messages, mistral_common_url)
print(tokens)

generated = generate(tokens, llama_cpp_url)["tokens"]
print(generated)

detokenized = detokenize(generated, mistral_common_url)
print(detokenized)

detokenized_message = detokenize_message(generated, mistral_common_url)
print(detokenized_message)

Feedback and Contributions

We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.

ggerganov · 2025-07-17T15:49:49Z

Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on llama.cpp is always recommended. From usability standpoint, the requirement to start a separate tokenization server is a bit of a drawback, but I understand that correctness is of higher importance.

My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability.

ehoogeveen-medweb · 2025-07-17T16:22:06Z

IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as mistral.attention.sliding_window), and the model could eventually be hooked into llama.cpp's SWA support.

Improve Mistral models integration with llama.cpp

dcb3bc5

github-actions bot added the python python script changes label Jul 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Mistral models integration with llama.cpp #14737

Improve Mistral models integration with llama.cpp #14737

juliendenize commented Jul 17, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 17, 2025

Uh oh!

ehoogeveen-medweb commented Jul 17, 2025

Uh oh!

Uh oh!

Improve Mistral models integration with llama.cpp #14737

Are you sure you want to change the base?

Improve Mistral models integration with llama.cpp #14737

Conversation

juliendenize commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Using mistral-common with llama.cpp

Added features

Known Limitations:

Example Code

(Optional) Convert the model

Launch the mistral-common and llama.cpp servers

Use the servers

Feedback and Contributions

Uh oh!

ggerganov commented Jul 17, 2025

Uh oh!

ehoogeveen-medweb commented Jul 17, 2025

Uh oh!

Uh oh!

juliendenize commented Jul 17, 2025 •

edited

Loading